feat(cli): add Kaggle dataset integration and Croissant metadata parsing by closestfriend · Pull Request #11 · ScrapeGraphAI/toonify

closestfriend · 2026-01-09T06:47:52Z

Description

Add Kaggle dataset integration and Croissant (ML Commons) metadata parsing to streamline dataset-to-TOON workflows. This enables users to download Kaggle datasets and convert them to TOON format in a single command.

Features

New CLI flags:

--kaggle - Treat input as Kaggle dataset slug
--croissant - Parse input as Croissant JSON-LD metadata
--file / -f - Select specific file from multi-file datasets

Usage examples:

# Download Kaggle dataset and convert to TOON
toon username/dataset-name --kaggle --stats

# Select specific file from dataset
toon username/dataset-name --kaggle --file data.csv

# Parse Croissant metadata to see schema
toon metadata.json --croissant

New Python API:

from toon import download_dataset, parse_croissant, csv_to_records

# Download and process Kaggle dataset
files = download_dataset("username/dataset-name")
csv_file = find_best_csv(files)

# Parse Croissant metadata
info = parse_croissant(metadata)
print(info['schema'])

Implementation

New module toon/kaggle.py provides:

download_dataset() - Download Kaggle datasets via kaggle CLI
find_best_csv() - Heuristic selection of main data file
csv_to_records() - CSV to list[dict] conversion
parse_croissant() - Extract schema from Croissant JSON-LD
croissant_to_summary() - Generate human-readable summaries
is_kaggle_slug() - Detect Kaggle dataset slug format

All imports are optional - gracefully degrades if kaggle package is not installed.

Type of Change

New feature (non-breaking change which adds functionality)

Testing

All tests pass
Added 12 new tests for Kaggle integration
Tested manually with real Kaggle datasets

Checklist

Code follows the project's style guidelines
Self-review completed
Documentation updated (CLI help, docstrings)
No new warnings or errors introduced

Add new --kaggle and --croissant CLI flags for streamlined dataset workflows: - `toon username/dataset --kaggle` downloads and converts Kaggle datasets to TOON - `toon metadata.json --croissant` parses ML Commons Croissant metadata - `--file` flag to select specific files from multi-file datasets - Auto-detection of Kaggle slugs (username/dataset-name format) New module toon/kaggle.py provides: - download_dataset(): Download Kaggle datasets via kaggle CLI - find_best_csv(): Heuristic selection of main data file - csv_to_records(): CSV to list[dict] conversion - parse_croissant(): Extract schema from Croissant JSON-LD - croissant_to_summary(): Generate human-readable dataset summaries All functions are optional imports - gracefully degrades if kaggle package is not installed. Includes comprehensive test suite (12 tests, 100% pass).

Copilot

Pull request overview

Adds Kaggle dataset download support and Croissant (ML Commons) JSON-LD parsing to the TOON tooling so users can go from dataset metadata/slug to TOON output via the CLI (and via a small Python API surface).

Changes:

Introduces toon/kaggle.py with Kaggle CLI download utilities, CSV selection/conversion, and Croissant metadata parsing/summary helpers.
Extends toon CLI with --kaggle, --croissant, and --file/-f flows to download/parse and then encode to TOON.
Exposes Kaggle/Croissant helpers from toon/__init__.py and adds unit tests for the new module.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File	Description
`toon/kaggle.py`	New Kaggle/Croissant helper module (download via kaggle CLI, CSV heuristics, Croissant schema extraction & summary).
`toon/cli.py`	Adds new CLI flags and execution paths for Kaggle downloads and Croissant parsing.
`toon/__init__.py`	Exports Kaggle/Croissant helpers as part of the public API (attempts “optional” gating).
`tests/test_kaggle.py`	Adds tests for slug detection, CSV conversion, Croissant parsing/summary, and CSV selection heuristic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-06T15:55:41Z

toon/cli.py

+try:
+    from .kaggle import (
+        is_kaggle_slug,
+        download_dataset,
+        find_best_csv,
+        csv_to_records,
+        parse_croissant,
+        croissant_to_summary,
+    )
+    KAGGLE_AVAILABLE = True
+except ImportError:
+    KAGGLE_AVAILABLE = False
+


KAGGLE_AVAILABLE is set based on importing toon.kaggle, but that module has no external imports, so this will be True even when the Kaggle CLI isn’t installed/configured. This makes the if not KAGGLE_AVAILABLE: branches effectively dead code and the error message about needing the “kaggle package” misleading. Consider removing this import-gating entirely and instead detect the kaggle executable (e.g., via shutil.which('kaggle')) or rely on download_dataset() raising a clear error, and update the messaging to refer to the Kaggle CLI/credentials rather than the Python package.

Copilot · 2026-02-06T15:55:41Z

toon/cli.py

-
+
+    # Handle Kaggle dataset download
+    if args.kaggle or (KAGGLE_AVAILABLE and args.input and is_kaggle_slug(args.input)):


This condition enables implicit Kaggle-slug auto-detection (treating args.input as Kaggle when it matches username/dataset), even if the user didn’t pass --kaggle. This behavior isn’t described in the PR description/CLI help and can change semantics for relative paths like data/user/file that don’t exist yet. Either require --kaggle explicitly or document the auto-detection behavior clearly (and consider making it opt-in).

Copilot · 2026-02-06T15:55:42Z

toon/cli.py

+    # Handle Kaggle dataset download
+    if args.kaggle or (KAGGLE_AVAILABLE and args.input and is_kaggle_slug(args.input)):
+        if not KAGGLE_AVAILABLE:
+            print('Error: Kaggle support requires the kaggle package. '
+                  'Install with: pip install kaggle', file=sys.stderr)
+            return 1
+
+        try:
+            print(f'Downloading Kaggle dataset: {args.input}', file=sys.stderr)
+            files = download_dataset(args.input)
+
+            # Find the target file
+            if args.select_file:
+                target = next(
+                    (f for f in files if args.select_file in f.name),
+                    None
+                )
+                if not target:
+                    print(f'Error: No file matching "{args.select_file}" in dataset',
+                          file=sys.stderr)
+                    print(f'Available files: {[f.name for f in files]}', file=sys.stderr)
+                    return 1
+            else:
+                target = find_best_csv(files)
+                if not target:
+                    # Try JSON files
+                    json_files = [f for f in files if f.suffix.lower() == '.json']
+                    target = json_files[0] if json_files else None
+
+                if not target:
+                    print('Error: No CSV or JSON files found in dataset', file=sys.stderr)
+                    return 1
+
+            print(f'Using: {target.name}', file=sys.stderr)
+
+            # Read and convert
+            content = target.read_text(encoding='utf-8', errors='replace')
+
+            if target.suffix.lower() == '.csv':
+                data = csv_to_records(content)
+            else:
+                data = json.loads(content)
+
+            # Encode to TOON
+            options = {
+                'delimiter': args.delimiter,
+                'indent': args.indent,
+                'key_folding': args.key_folding,
+            }
+            if args.flatten_depth is not None:
+                options['flatten_depth'] = args.flatten_depth
+
+            output_content = encode(data, options)
+            input_content = json.dumps(data)  # For stats comparison
+
+            # Show statistics if requested
+            if args.stats:
+                input_tokens = count_tokens(input_content)
+                output_tokens = count_tokens(output_content)
+
+                print(f'Input (JSON):  {len(input_content)} bytes', file=sys.stderr)
+                print(f'Output (TOON): {len(output_content)} bytes', file=sys.stderr)
+                if len(input_content) > 0:
+                    print(f'Size reduction: {(1 - len(output_content) / len(input_content)) * 100:.1f}%',
+                          file=sys.stderr)
+
+                if input_tokens is not None and output_tokens is not None:
+                    print(f'Input tokens:  {input_tokens}', file=sys.stderr)
+                    print(f'Output tokens: {output_tokens}', file=sys.stderr)
+                    print(f'Token reduction: {(1 - output_tokens / input_tokens) * 100:.1f}%',
+                          file=sys.stderr)
+                else:
+                    print('(Install tiktoken for token statistics)', file=sys.stderr)
+
+                print('---', file=sys.stderr)
+
+            write_output(output_content, args.output)
+            return 0
+
+        except Exception as e:
+            print(f'Error: {e}', file=sys.stderr)
+            return 1


New Kaggle/Croissant CLI flows added here aren’t covered by tests. Since tests/test_cli.py already exists, it would be good to add tests for --kaggle/--croissant behavior (mocking subprocess.run / download_dataset, and verifying file selection + output) to prevent regressions.

Copilot · 2026-02-06T15:55:42Z

toon/kaggle.py

+
+import csv
+import io
+import json


import json is unused (only referenced in a docstring example). If the project enforces linting, this will fail; otherwise it’s still noise. Consider removing the import or using json in actual code.

Suggested change

import json

Copilot · 2026-02-06T15:55:42Z

toon/kaggle.py

+    if output_dir is None:
+        output_dir = tempfile.mkdtemp(prefix="toon_kaggle_")
+


When output_dir is not provided, this creates a temp directory via mkdtemp() that is never cleaned up by the library or the CLI path (leaving behind toon_kaggle_* directories). Consider using tempfile.TemporaryDirectory() in the CLI flow (so it’s cleaned up), or documenting/returning the temp directory path and providing a cleanup option.

Copilot · 2026-02-06T15:55:42Z

toon/cli.py

+                target = next(
+                    (f for f in files if args.select_file in f.name),
+                    None
+                )
+                if not target:
+                    print(f'Error: No file matching "{args.select_file}" in dataset',
+                          file=sys.stderr)
+                    print(f'Available files: {[f.name for f in files]}', file=sys.stderr)
+                    return 1


File selection uses substring matching (args.select_file in f.name) and returns the first match from an arbitrary rglob() ordering. This can select the wrong file when multiple names contain the substring (e.g., data.csv vs mydata.csv). Prefer exact matches first (e.g., f.name == args.select_file), and if multiple matches exist, surface an error listing candidates.

Suggested change

target = next(

(f for f in files if args.select_file in f.name),

None

)

if not target:

print(f'Error: No file matching "{args.select_file}" in dataset',

file=sys.stderr)

print(f'Available files: {[f.name for f in files]}', file=sys.stderr)

return 1

# Prefer exact filename matches first

exact_matches = [f for f in files if f.name == args.select_file]

if len(exact_matches) == 1:

target = exact_matches[0]

elif len(exact_matches) > 1:

print(

f'Error: Multiple files named "{args.select_file}" found in dataset:',

file=sys.stderr,

)

for f in exact_matches:

print(f' - {f}', file=sys.stderr)

return 1

else:

# Fall back to substring matches, but require them to be unambiguous

substring_matches = [f for f in files if args.select_file in f.name]

if len(substring_matches) == 1:

target = substring_matches[0]

elif len(substring_matches) > 1:

print(

f'Error: Multiple files matching "{args.select_file}" in dataset:',

file=sys.stderr,

)

for f in substring_matches:

print(f' - {f}', file=sys.stderr)

return 1

else:

print(

f'Error: No file matching "{args.select_file}" in dataset',

file=sys.stderr,

)

print(

f'Available files: {[f.name for f in files]}',

file=sys.stderr,

)

return 1

Copilot · 2026-02-06T15:55:43Z

toon/kaggle.py

+                data_types = field.get("dataType", ["unknown"])
+                type_str = data_types[0] if data_types else "unknown"
+                # Clean up schema.org prefixes
+                type_str = type_str.replace("sc:", "").replace("https://schema.org/", "")
+


field.get('dataType') is treated as a list (data_types[0]), but in JSON-LD it can also be a single string. If dataType is a string, this will take the first character (e.g., 'sc:Integer' -> 's') and produce an incorrect schema. Normalize dataType to a list (or handle str explicitly) before selecting the first type.

Copilot · 2026-02-06T15:55:43Z

toon/__init__.py

+# Kaggle integration (optional - requires kaggle installation)
+try:
+    from .kaggle import (
+        download_dataset,
+        find_best_csv,
+        csv_to_records,
+        parse_croissant,
+        croissant_to_summary,
+        is_kaggle_slug,
+    )
+    _KAGGLE_AVAILABLE = True
+except ImportError:
+    _KAGGLE_AVAILABLE = False
+    def download_dataset(*args, **kwargs):
+        raise ImportError("download_dataset requires kaggle to be installed. Please install kaggle to use this feature.")
+    def find_best_csv(*args, **kwargs):
+        raise ImportError("find_best_csv requires kaggle to be installed. Please install kaggle to use this feature.")
+    def csv_to_records(*args, **kwargs):
+        raise ImportError("csv_to_records requires kaggle to be installed. Please install kaggle to use this feature.")
+    def parse_croissant(*args, **kwargs):
+        raise ImportError("parse_croissant requires kaggle to be installed. Please install kaggle to use this feature.")
+    def croissant_to_summary(*args, **kwargs):
+        raise ImportError("croissant_to_summary requires kaggle to be installed. Please install kaggle to use this feature.")
+    def is_kaggle_slug(*args, **kwargs):
+        raise ImportError("is_kaggle_slug requires kaggle to be installed. Please install kaggle to use this feature.")


The try/except ImportError gating here is misleading: toon.kaggle is part of this package and only uses stdlib imports, so this import will succeed regardless of whether the user has the Kaggle CLI installed. As a result, the fallback stubs will never be used, and the error messages about needing to “install kaggle” don’t reflect the actual runtime dependency (the kaggle executable + credentials). Consider removing the ImportError gating and documenting/checking for the Kaggle CLI instead, or explicitly gating on shutil.which('kaggle') if you want a true “availability” switch.

Copilot · 2026-02-06T15:55:43Z

toon/cli.py

+        if not KAGGLE_AVAILABLE:
+            print('Error: Croissant support requires the kaggle module.', file=sys.stderr)


The --croissant path is gated on KAGGLE_AVAILABLE and the error message says it “requires the kaggle module”, but Croissant parsing here is pure-stdlib (json + parse_croissant) and shouldn’t require the Kaggle CLI. Consider decoupling Croissant support from any Kaggle availability checks and updating the message accordingly (or dropping the check entirely).

Suggested change

if not KAGGLE_AVAILABLE:

print('Error: Croissant support requires the kaggle module.', file=sys.stderr)

# Croissant support depends on the availability of parse_croissant,

# but does not require the Kaggle CLI itself.

if 'parse_croissant' not in globals():

print('Error: Croissant support is not available in this installation.', file=sys.stderr)

Copilot · 2026-02-06T15:55:44Z

tests/test_kaggle.py

@@ -0,0 +1,190 @@
+"""Tests for Kaggle integration module."""
+
+import pytest


Import of 'pytest' is not used.

Suggested change

import pytest

VinciGit00

LGTM

github-actions · 2026-02-06T16:00:12Z

🎉 This PR is included in version 1.6.0 🎉

The release is available on:

v1.6.0
GitHub release

Your semantic-release bot 📦🚀

VinciGit00 requested a review from Copilot February 6, 2026 15:51

Copilot started reviewing on behalf of VinciGit00 February 6, 2026 15:52 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

VinciGit00 approved these changes Feb 6, 2026

View reviewed changes

VinciGit00 merged commit b9a5204 into ScrapeGraphAI:main Feb 6, 2026
8 checks passed

github-actions bot added the released label Feb 6, 2026



		# Handle Kaggle dataset download
		if args.kaggle or (KAGGLE_AVAILABLE and args.input and is_kaggle_slug(args.input)):

		if output_dir is None:
		output_dir = tempfile.mkdtemp(prefix="toon_kaggle_")

-                target = next(
-                    (f for f in files if args.select_file in f.name),
-                    None
-                )
-                if not target:
-                    print(f'Error: No file matching "{args.select_file}" in dataset',
-                          file=sys.stderr)
-                    print(f'Available files: {[f.name for f in files]}', file=sys.stderr)
-                    return 1
+                # Prefer exact filename matches first
+                exact_matches = [f for f in files if f.name == args.select_file]
+                if len(exact_matches) == 1:
+                    target = exact_matches[0]
+                elif len(exact_matches) > 1:
+                    print(
+                        f'Error: Multiple files named "{args.select_file}" found in dataset:',
+                        file=sys.stderr,
+                    )
+                    for f in exact_matches:
+                        print(f'  - {f}', file=sys.stderr)
+                    return 1
+                else:
+                    # Fall back to substring matches, but require them to be unambiguous
+                    substring_matches = [f for f in files if args.select_file in f.name]
+                    if len(substring_matches) == 1:
+                        target = substring_matches[0]
+                    elif len(substring_matches) > 1:
+                        print(
+                            f'Error: Multiple files matching "{args.select_file}" in dataset:',
+                            file=sys.stderr,
+                        )
+                        for f in substring_matches:
+                            print(f'  - {f}', file=sys.stderr)
+                        return 1
+                    else:
+                        print(
+                            f'Error: No file matching "{args.select_file}" in dataset',
+                            file=sys.stderr,
+                        )
+                        print(
+                            f'Available files: {[f.name for f in files]}',
+                            file=sys.stderr,
+                        )
+                        return 1

		if not KAGGLE_AVAILABLE:
		print('Error: Croissant support requires the kaggle module.', file=sys.stderr)

-        if not KAGGLE_AVAILABLE:
-            print('Error: Croissant support requires the kaggle module.', file=sys.stderr)
+        # Croissant support depends on the availability of parse_croissant,
+        # but does not require the Kaggle CLI itself.
+        if 'parse_croissant' not in globals():
+            print('Error: Croissant support is not available in this installation.', file=sys.stderr)

		@@ -0,0 +1,190 @@
		"""Tests for Kaggle integration module."""

		import pytest

Conversation

closestfriend commented Jan 9, 2026

Description

Features

New CLI flags:

Usage examples:

New Python API:

Implementation

Type of Change

Testing

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

VinciGit00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants